基于受限玻尔兹曼机的频谱建模与单元挑选语音合成方法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201508001

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (467 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract A restricted Boltzmann machine based spectrum modeling and unit selection speech synthesis method is proposed. At the model training stage, the restricted Boltzmann machine is used to model spectral features with rich details, such as spectral envelopes and short-time spectral amplitudes, instead of using the single Gaussian model with diagonal variance and mel-cepstrum feature for spectral model in the traditional approach. Thus, the description capability of the acoustical model for spectral feature is improved. At the speech synthesis stage, the restricted Boltzmann machine model is adopted to calculate the log likelihoods of spectral feature of candidate sample, and a method of piecewise linear mapping is proposed to construct target cost function for unit selection. The experimental results indicate that the proposed method can effectively improve the naturalness of synthetic speech.

Key words： Speech Synthesis Unit Selection Hidden Markov Model Restricted Boltzmann Machine

Received: 25 April 2014

ZTFLH:

TN 912.33

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	SONG Yang
	LING Zhen-Hua
	DAI Li-Rong

Cite this article:

SONG Yang,LING Zhen-Hua,DAI Li-Rong. Restricted Boltzmann Machine Based Spectrum Modeling and Unit Selection Speech Synthesis Method[J]. , 2015, 28(8): 673-679.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.201508001 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2015/V28/I8/673

[1] Mizutani T, Kagoshima T. Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method. IEICE Trans on Information and Systems, 2005, 88(11): 2565-2572
[2] Gros J Z, Zganec M. An Efficient Unit-Selection Method for Conca-tenative Text-to-Speech Synthesis Systems. Journal of Computing and Information Technology, 2008, 16(1): 69-78
[3] Ling Z H, Wang R H. HMM-Based Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion // Proc of the International Conference on Acoustics, Speech and Signal Processing. Honolulu, USA, 2007, IV: 1245-1248
[4] Wang R H, Dai L R, Ling Z H, et al. Trainable Unit Selection Speech Synthesis under Statistical Framework. Chinese Science Bu-lletin, 2009, 54(8): 1133-1138 (in Chinese)
(王仁华,戴礼荣,凌震华,等.基于统计建模的可训练单元挑选语音合成方法.科学通报, 2009, 54(8): 1133-1138)
[5] Ling Z H, Wang R H. Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis. Pattern Recognition and Artificial Intelligence, 2008, 21(3): 280-284 (in Chinese)
(凌震华,王仁华.基于统计声学模型的单元挑选语音合成算法.模式识别与人工智能, 2008, 21(3): 280-284)
[6] Ling Z H, Lu H, Hu G P, et al. The USTC System for Blizzard Challenge 2008[EB/OL]. [2014-04-01]. http://www.festvox.org/blizzard/bc2008/ustc_Blizzard2008.pdf
[7] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks. Science, 2006, 313(5786): 504-507
[8] Ling Z H, Li D, Yu D. Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis. IEEE Trans on Audio, Speech, and
Language Processing, 2013, 21(10): 2129-2139
[9] Kawahara H, Masuda-Katsuse I, de Cheveigné A. Restructuring Speech Representations Using a Pitch-Adaptive Time-Frequency Smoothing and an Instantaneous-Frequency-Based F0 Extraction: Possible Role of a Repetitive Structure in Sounds. Speech Communication, 1999, 27(3/4): 187-207
[10] Tokuda K, Masuko T, Miyakazi N, et al. Multi-space Probability Distribution HMM. IEICE Trans on Information and Systems, 2002, E85-D(3): 455-464
[11] Ling Z H, Wang Z G, Dai L R, et al. Statistical Modeling of Syllable-Level F0 Features for HMM-Based Unit Selection Speech Synthesis // Proc of the 7th International Symposium on Chinese Spoken Language Processing. Tainan, China, 2010: 144-147
[12] Salakhutdinov R. Learning Deep Generative Models. Ph.D Dissertation. Toronto, Canada: University of Toronto, 2009
[13] Hinton G E. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 2002, 14(8): 1771-1800